Slurm jobs management¶
This page is a practical command reference for managing jobs once you already know the basics from the Slurm (quick guide).
For partition defaults and limits, see Advanced partitions.
Job lifecycle¶
1) Prepare a batch script¶
Use the template provided in your home directory:
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch
At minimum, set:
- partition (
#SBATCH --partition=...) - walltime (
#SBATCH --time=...) - the Python command to run
2) Submit the job¶
sbatch job.sbatch
You will get a job ID, for example:
Submitted batch job 29509
3) Monitor queue and status¶
squeue -u $USER
scontrol show job <jobid>
Useful job states:
PD: pendingR: runningCG: completingCD: completedF: failedTO: timed out
4) Check accounting/history¶
sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,ExitCode
5) Read logs¶
By default, output goes to slurm-<jobid>.out (or your custom --output/--error paths).
ls -lh slurm-<jobid>.out
tail -n 100 slurm-<jobid>.out
6) Cancel a job¶
scancel <jobid>
Fairshare and priority¶
When resources are busy, priority is influenced by fairshare and other scheduler factors.
sshare -l
sprio
Advanced patterns¶
Job arrays¶
Use arrays for many independent runs:
sbatch --array=0-31 job.sbatch
sbatch --array=1,3,5,7 job.sbatch
sbatch --array=1-7:2 job.sbatch
Inside the script, use SLURM_ARRAY_TASK_ID.
Job dependencies (chaining)¶
sbatch step1.sbatch
sbatch --dependency=afterok:74698 step2.sbatch
sbatch --dependency=afterok:74698:74699 step3.sbatch
Common rules:
after:<jobid>afterany:<jobid>afterok:<jobid>afternotok:<jobid>singleton
If a dependency can never be satisfied, cancel the stuck jobs with scancel.
Notes specific to this DGX¶
- With QoS
normal, you can run up to 4 jobs at the same time. - With QoS
normal, only 2 running jobs are allowed acrossprod40+prod80. - Partition is required in submissions.
prod10,prod40,prod80are batch-oriented (usesbatch).- Use
interactive10withsrunfor interactive GPU debugging. - If you need more resources (or another QoS policy), contact support:
dgx_support@listes.centralesupelec.fr.